Churned Customers are those who have decided to end their relationship with their existing banking organizations. Churned Customers means a direct loss of Marketing Acquisition Cost and possible revenue which could be capitalized post sale. Hence, predicting possible customers who can churn beforehand can help us save this loss.
Dataset Description Input Attributes
Importing the basic libaries to start with understanding the dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import recall_score, accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from imblearn.combine import SMOTEENN
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import StandardScaler
Reading the Dataset
df = pd.read_csv('ChurnDataF.csv')
The history saving thread hit an unexpected error (OperationalError('database or disk is full')).History will not be written to the database.
df.head(10)
| CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15634602 | Hargrave | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
| 1 | 15647311 | Hill | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
| 2 | 15619304 | Onio | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
| 3 | 15701354 | Boni | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
| 4 | 15737888 | Mitchell | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
| 5 | 15574012 | Chu | 645 | Spain | Male | 44 | 8 | 113755.78 | 2 | 1 | 0 | 149756.71 | 1 |
| 6 | 15592531 | Bartlett | 822 | France | Male | 50 | 7 | 0.00 | 2 | 1 | 1 | 10062.80 | 0 |
| 7 | 15656148 | Obinna | 376 | Germany | Female | 29 | 4 | 115046.74 | 4 | 1 | 0 | 119346.88 | 1 |
| 8 | 15792365 | He | 501 | France | Male | 44 | 4 | 142051.07 | 2 | 0 | 1 | 74940.50 | 0 |
| 9 | 15592389 | H? | 684 | France | Male | 27 | 2 | 134603.88 | 1 | 1 | 1 | 71725.73 | 0 |
df.shape
(10000, 13)
!pip install missingno
Requirement already satisfied: missingno in c:\users\arunima\anaconda3\lib\site-packages (0.5.2) Requirement already satisfied: numpy in c:\users\arunima\anaconda3\lib\site-packages (from missingno) (1.23.5) Requirement already satisfied: matplotlib in c:\users\arunima\anaconda3\lib\site-packages (from missingno) (3.7.0) Requirement already satisfied: scipy in c:\users\arunima\anaconda3\lib\site-packages (from missingno) (1.10.0) Requirement already satisfied: seaborn in c:\users\arunima\anaconda3\lib\site-packages (from missingno) (0.12.2) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\arunima\anaconda3\lib\site-packages (from matplotlib->missingno) (3.0.9) Requirement already satisfied: contourpy>=1.0.1 in c:\users\arunima\anaconda3\lib\site-packages (from matplotlib->missingno) (1.0.5) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\arunima\anaconda3\lib\site-packages (from matplotlib->missingno) (1.4.4) Requirement already satisfied: python-dateutil>=2.7 in c:\users\arunima\anaconda3\lib\site-packages (from matplotlib->missingno) (2.8.2) Requirement already satisfied: pillow>=6.2.0 in c:\users\arunima\anaconda3\lib\site-packages (from matplotlib->missingno) (9.4.0) Requirement already satisfied: packaging>=20.0 in c:\users\arunima\anaconda3\lib\site-packages (from matplotlib->missingno) (22.0) Requirement already satisfied: fonttools>=4.22.0 in c:\users\arunima\anaconda3\lib\site-packages (from matplotlib->missingno) (4.25.0) Requirement already satisfied: cycler>=0.10 in c:\users\arunima\anaconda3\lib\site-packages (from matplotlib->missingno) (0.11.0) Requirement already satisfied: pandas>=0.25 in c:\users\arunima\anaconda3\lib\site-packages (from seaborn->missingno) (1.5.3) Requirement already satisfied: pytz>=2020.1 in c:\users\arunima\anaconda3\lib\site-packages (from pandas>=0.25->seaborn->missingno) (2022.7) Requirement already satisfied: six>=1.5 in c:\users\arunima\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib->missingno) (1.16.0)
understanding the patterns
import missingno as msno
msno.matrix(df)
<Axes: >
There are no missing values present in the dataset
df.skew()
C:\Users\ARUNIMA\AppData\Local\Temp\ipykernel_25200\1665899112.py:1: FutureWarning: The default value of numeric_only in DataFrame.skew is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning. df.skew()
CustomerId 0.001149 CreditScore -0.071607 Age 1.011320 Tenure 0.010991 Balance -0.141109 NumOfProducts 0.745568 HasCrCard -0.901812 IsActiveMember -0.060437 EstimatedSalary 0.002085 Exited 1.471611 dtype: float64
Skewness is a measure of the degree to which the data distribution is skewed to the left or right. A negative skew indicates that the distribution is skewed to the left (tail is longer on the left), while a positive skew indicates that the distribution is skewed to the right (tail is longer on the right). The output shows the skewness value for each feature, where a value closer to 0 indicates less skew, negative values indicate left skew, and positive values indicate right skew.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerId 10000 non-null int64 1 Surname 10000 non-null object 2 CreditScore 10000 non-null int64 3 Geography 10000 non-null object 4 Gender 10000 non-null object 5 Age 10000 non-null int64 6 Tenure 10000 non-null int64 7 Balance 10000 non-null float64 8 NumOfProducts 10000 non-null int64 9 HasCrCard 10000 non-null int64 10 IsActiveMember 10000 non-null int64 11 EstimatedSalary 10000 non-null float64 12 Exited 10000 non-null int64 dtypes: float64(2), int64(8), object(3) memory usage: 1015.8+ KB
df.dtypes
CustomerId int64 Surname object CreditScore int64 Geography object Gender object Age int64 Tenure int64 Balance float64 NumOfProducts int64 HasCrCard int64 IsActiveMember int64 EstimatedSalary float64 Exited int64 dtype: object
df.describe
<bound method NDFrame.describe of CustomerId Surname CreditScore Geography Gender Age Tenure \
0 15634602 Hargrave 619 France Female 42 2
1 15647311 Hill 608 Spain Female 41 1
2 15619304 Onio 502 France Female 42 8
3 15701354 Boni 699 France Female 39 1
4 15737888 Mitchell 850 Spain Female 43 2
... ... ... ... ... ... ... ...
9995 15606229 Obijiaku 771 France Male 39 5
9996 15569892 Johnstone 516 France Male 35 10
9997 15584532 Liu 709 France Female 36 7
9998 15682355 Sabbatini 772 Germany Male 42 3
9999 15628319 Walker 792 France Female 28 4
Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary \
0 0.00 1 1 1 101348.88
1 83807.86 1 0 1 112542.58
2 159660.80 3 1 0 113931.57
3 0.00 2 0 0 93826.63
4 125510.82 1 1 1 79084.10
... ... ... ... ... ...
9995 0.00 2 1 0 96270.64
9996 57369.61 1 1 1 101699.77
9997 0.00 1 0 1 42085.58
9998 75075.31 2 1 0 92888.52
9999 130142.79 1 1 0 38190.78
Exited
0 1
1 0
2 1
3 0
4 0
... ...
9995 0
9996 0
9997 1
9998 1
9999 0
[10000 rows x 13 columns]>
df.isnull().sum()
CustomerId 0 Surname 0 CreditScore 0 Geography 0 Gender 0 Age 0 Tenure 0 Balance 0 NumOfProducts 0 HasCrCard 0 IsActiveMember 0 EstimatedSalary 0 Exited 0 dtype: int64
df.nunique()
CustomerId 10000 Surname 2932 CreditScore 460 Geography 3 Gender 2 Age 70 Tenure 11 Balance 6382 NumOfProducts 4 HasCrCard 2 IsActiveMember 2 EstimatedSalary 9999 Exited 2 dtype: int64
By this we can infer that the customer Id is Unique to every Customer. Thus it will have no effect on the target Variable.
df = df.drop(['CustomerId'], axis=1 )
df
| Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Hargrave | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
| 1 | Hill | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
| 2 | Onio | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
| 3 | Boni | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
| 4 | Mitchell | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9995 | Obijiaku | 771 | France | Male | 39 | 5 | 0.00 | 2 | 1 | 0 | 96270.64 | 0 |
| 9996 | Johnstone | 516 | France | Male | 35 | 10 | 57369.61 | 1 | 1 | 1 | 101699.77 | 0 |
| 9997 | Liu | 709 | France | Female | 36 | 7 | 0.00 | 1 | 0 | 1 | 42085.58 | 1 |
| 9998 | Sabbatini | 772 | Germany | Male | 42 | 3 | 75075.31 | 2 | 1 | 0 | 92888.52 | 1 |
| 9999 | Walker | 792 | France | Female | 28 | 4 | 130142.79 | 1 | 1 | 0 | 38190.78 | 0 |
10000 rows × 12 columns
#Find the duplicates
df.duplicated().sum()
0
data = df.copy()
data
| Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Hargrave | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
| 1 | Hill | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
| 2 | Onio | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
| 3 | Boni | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
| 4 | Mitchell | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9995 | Obijiaku | 771 | France | Male | 39 | 5 | 0.00 | 2 | 1 | 0 | 96270.64 | 0 |
| 9996 | Johnstone | 516 | France | Male | 35 | 10 | 57369.61 | 1 | 1 | 1 | 101699.77 | 0 |
| 9997 | Liu | 709 | France | Female | 36 | 7 | 0.00 | 1 | 0 | 1 | 42085.58 | 1 |
| 9998 | Sabbatini | 772 | Germany | Male | 42 | 3 | 75075.31 | 2 | 1 | 0 | 92888.52 | 1 |
| 9999 | Walker | 792 | France | Female | 28 | 4 | 130142.79 | 1 | 1 | 0 | 38190.78 | 0 |
10000 rows × 12 columns
Creating BoxPlots to understand their patterns for numerical data
Using boxplots to analyse the data features individualy
Finding out the minimum and maximum value to establish the range in the boxplots for better depictions of the variability of the data feature.
Also to make the analysis more interactive and easy to understand we can also use Plotly for representation of the bargraphs, histograms and scatterplots.
import plotly.express as px
fig = px.histogram(df,x='Age',title='Age variation')
fig.show()
# Find the maximum value of the 'Age' column
max_age = df['Age'].max()
print(max_age)
#Find min value
min_age = df['Age'].min()
print(min_age)
92 18
# Univariate analysis of the 'Age' column
# Histogram to visualize the distribution of 'Age'
plt.hist(df['Age'], bins=10,range=[10,100])
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution')
plt.show()
This plot is showing that maximum customers lie in the range of 25-60 approximately.
# Find the maximum value of the 'CreditScore' column
max_score = df['CreditScore'].max()
print(max_score)
#Min value of the Creditscore
min_score = df['CreditScore'].min()
print(min_score)
850 350
fig = px.histogram(df,x='CreditScore',title='CreditScore variation')
fig.show()
# Univariate analysis of the 'CreditScore' column
# Histogram to visualize the distribution of 'CreditScore'
plt.hist(df['CreditScore'], bins=20,range=[300,900])
plt.xlabel('CreditScore')
plt.ylabel('Frequency')
plt.title('CreditScore Distribution')
plt.show()
CreditScore has varied range starting from 400 till approximately 850.
# Univariate analysis of the 'Tenure' column
# Histogram to visualize the distribution of 'Tenure'
plt.hist(df['Tenure'], bins=15)
plt.xlabel('Tenure')
plt.ylabel('Frequency')
plt.title('Tenure Distribution')
plt.show()
fig = px.histogram(df,x='Tenure',title='Tenure variation')
fig.show()
# Univariate analysis of the 'Noofproducts' column
# Histogram to visualize the distribution of 'Noofproducts'
plt.hist(df['NumOfProducts'],bins=10)
plt.xlabel('NumOfProducts')
plt.ylabel('Frequency')
plt.title('NumOfProducts Distribution')
plt.show()
fig = px.histogram(df,x='NumOfProducts',title='Tenure variation')
fig.show()
# Univariate analysis of the 'HasCrCard' column
# Histogram to visualize the distribution of 'HasCrCard'
plt.hist(df['HasCrCard'],range=[0,1])
plt.xlabel('HasCrCard')
plt.ylabel('Frequency')
plt.title('HasCrCard Distribution')
plt.show()
# Univariate analysis of the 'IsActiveMember' column
# Histogram to visualize the distribution of 'IsActiveMember'
plt.hist(df['IsActiveMember'],range=[0,1])
plt.xlabel('IsActiveMember')
plt.ylabel('Frequency')
plt.title('IsActiveMember Distribution')
plt.show()
Creating BarPlots to understand their patterns for categorical data
# Graphical representation of the geography using bar chart
df['Geography'].value_counts().plot(kind='bar',y='value_column',title='Frequency Distribution Of Geography')
<Axes: title={'center': 'Frequency Distribution Of Geography'}>
# Graphical representation of the gender using bar chart
df['Gender'].value_counts().plot(kind='bar',y='value_column',title='Frequency Distribution Of Gender')
<Axes: title={'center': 'Frequency Distribution Of Gender'}>
fig = px.histogram(df,x='Surname',title='Surname variation')
fig.show()
We can see the count of people with similar surnames vary over a large scale ranging from 1 to 30 but as we can see the trend and pattern is not uniform, we can infer from this that the surname might not be a great factor affecting the excited target variable. We will decide to keep it or drop it after doing more analysis on it, By doing bivariate and trivariate analysis to understand more about it.
fig = px.histogram(df,x='EstimatedSalary',title='Salary variation')
fig.show()
all the salary values are approximately of equal variations, we can see that all values are close enough and their is no large slope of rise or decline. People we different salary values are approximately equivalent.
#df = df.drop(['Surname'] , axis=1)
df
| Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Hargrave | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
| 1 | Hill | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
| 2 | Onio | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
| 3 | Boni | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
| 4 | Mitchell | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9995 | Obijiaku | 771 | France | Male | 39 | 5 | 0.00 | 2 | 1 | 0 | 96270.64 | 0 |
| 9996 | Johnstone | 516 | France | Male | 35 | 10 | 57369.61 | 1 | 1 | 1 | 101699.77 | 0 |
| 9997 | Liu | 709 | France | Female | 36 | 7 | 0.00 | 1 | 0 | 1 | 42085.58 | 1 |
| 9998 | Sabbatini | 772 | Germany | Male | 42 | 3 | 75075.31 | 2 | 1 | 0 | 92888.52 | 1 |
| 9999 | Walker | 792 | France | Female | 28 | 4 | 130142.79 | 1 | 1 | 0 | 38190.78 | 0 |
10000 rows × 12 columns
We will try to look at the boxplotting of data features that are continouse to check for the outliers and understands its dependence
#Detection of outliers in tenure
sns.boxplot(df['Tenure'])
<Axes: >
As we can see all the values of tenure lies between the specific range and their are no outliers in this data feature.
#Detection of outliers in Age
sns.boxplot(df['Age'])
<Axes: >
Here the majority of the chunk is of the age range of 30-50, the outliers age range goes from 60-90 as the value their is not in chunk
# Detection of outliers in credit score
sns.boxplot(df['CreditScore'])
<Axes: >
here the major credit score of the customer is in the range of 580-720 and the outlier range is quite low i.e. around 100-400
# Detection of outliers in credit score
sns.boxplot(df['Balance'])
<Axes: >
The range for balance is very continous with no abnormal value.
After doing the analysis of various variable by doing univariate analysis and visualising their histograms and boxplots. We will understand the bais in the Exited Variable.
churn_count = df['Exited'].sum()
retained_count = df.shape[0] - churn_count
churn_ratio = churn_count/ df.shape[0]
retained_ratio = retained_count/ df.shape[0]
churn_ratio
0.2037
retained_ratio
0.7963
The bais present in the Target variable is about 80% and 20%. Customers that churn is very less than the retained customer.
Visualize the data
To understand the distribution even better.
labels = ['Churned', 'Retained']
ratios = [churn_ratio, retained_ratio]
colors = ['red', 'green']
plt.pie(ratios, labels=labels, colors=colors)
plt.axis('equal')
plt.title('Customer Churn and Retention Ratio')
plt.show()
# plotting with target feature
sns.countplot(data=data, x='Exited')
plt.title('Count of Churn')
plt.show()
l1 = df.loc[df['Exited']== '1'].count()[0]
print(df.Exited.value_counts())
0 7963 1 2037 Name: Exited, dtype: int64
fig = px.histogram(df,x='Exited',title='Churned Customers variation')
fig.show()
The relation between the target and the rest of the variables
## Graphical representation of Gender against Exited
cross_tab = pd.crosstab(df["Gender"],df["Exited"])
cross_tab.plot.bar(rot=0)
plt.title("Effect of Gender on Exited ")
cross_tab
| Exited | 0 | 1 |
|---|---|---|
| Gender | ||
| Female | 3404 | 1139 |
| Male | 4559 | 898 |
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
plt.figure(figsize=(6, 6))
labels =["Exited: yes","Exited:No"]
values = [1869,5163]
labels_gender = ["F","M","F","M"]
sizes_gender = [939,930 , 2544,2619]
colors = ['#ff6666', '#66b3ff']
colors_gender = ['#c2c2f0','#ffb3e6', '#c2c2f0','#ffb3e6']
explode = (0.3,0.3)
explode_gender = (0.1,0.1,0.1,0.1)
textprops = {"fontsize":15}
#Plot
plt.pie(values, labels=labels,autopct='%1.1f%%',pctdistance=1.08, labeldistance=0.8,colors=colors, startangle=90,frame=True, explode=explode,radius=10, textprops =textprops, counterclock = True, )
plt.pie(sizes_gender,labels=labels_gender,colors=colors_gender,startangle=90, explode=explode_gender,radius=7, textprops =textprops, counterclock = True, )
#Draw circle
centre_circle = plt.Circle((0,0),5,color='black', fc='white',linewidth=0)
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.title('Churn Distribution w.r.t Gender: Male(M), Female(F)', fontsize=15, y=1.1)
# show plot
plt.axis('equal')
plt.tight_layout()
plt.show()
type_ = ["0", "1"]
fig = make_subplots(rows=1, cols=1)
fig.add_trace(go.Pie(labels=type_, values=df['Exited'].value_counts(), name="Exited"))
# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent+name", textfont_size=16)
fig.update_layout(
title_text="Churn Distributions",
annotations=[dict(text='Churn', x=0.5, y=0.5, font_size=20, showarrow=False)])
fig.show()
## Graphical representation of Creditcard against Exited
cross_tab = pd.crosstab(df['HasCrCard'],df['Exited'])
cross_tab.plot.bar(rot=0)
plt.title("Effect of Has CreditCard on Exited")
cross_tab
| Exited | 0 | 1 |
|---|---|---|
| HasCrCard | ||
| 0 | 2332 | 613 |
| 1 | 5631 | 1424 |
type_ = ["0", "1"]
fig = make_subplots(rows=1, cols=1)
fig.add_trace(go.Pie(labels=type_, values=df['HasCrCard'].value_counts(), name="HasCrCard"))
# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent+name", textfont_size=16)
fig.update_layout(
title_text="HasCrCard Distributions",
annotations=[dict(text='HasCrCard', x=0.5, y=0.5, font_size=20, showarrow=False)])
fig.show()
fig = px.histogram(data, x="Exited", color = "HasCrCard", barmode = "group", title = "<b>Customer distribution<b>")
fig.update_layout(width=700, height=500, bargap=0.2)
fig.show()
We are able to make out from this that people who have churned and has credit card proportion can easily be understood by the graphs and we can see that people who don't have credit card are more likely to churn that people with credit Card.
## Graphical representation of ActiveMember against Exited
cross_tab = pd.crosstab(df['IsActiveMember'],df['Exited'])
cross_tab.plot.bar(rot=0)
plt.title("Effect of isActivemember on Exited")
cross_tab
| Exited | 0 | 1 |
|---|---|---|
| IsActiveMember | ||
| 0 | 3547 | 1302 |
| 1 | 4416 | 735 |
type_ = ["0", "1"]
fig = make_subplots(rows=1, cols=1)
fig.add_trace(go.Pie(labels=type_, values=df['IsActiveMember'].value_counts(), name="IsActiveMember"))
# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.7, hoverinfo="label+percent+name", textfont_size=16)
fig.update_layout(
title_text="IsActiveMember Distributions",
annotations=[dict(text='IsActiveMember', x=0.5, y=0.5, font_size=20, showarrow=False)])
fig.show()
fig = px.histogram(data, x="Exited", color = "IsActiveMember", barmode = "group", title = "<b>Customer distribution<b>")
fig.update_layout(width=700, height=500, bargap=0.2)
fig.show()
By this plot we are able to see that people who are active member of that company
## Graphical representation of ActiveMember against Exited
cross_tab = pd.crosstab(df['NumOfProducts'],df['Exited'])
cross_tab.plot.bar(rot=0)
plt.title("Effect of NumOfProducts on Exited")
cross_tab
| Exited | 0 | 1 |
|---|---|---|
| NumOfProducts | ||
| 1 | 3675 | 1409 |
| 2 | 4242 | 348 |
| 3 | 46 | 220 |
| 4 | 0 | 60 |
fig = px.histogram(data, x="Exited", color = "NumOfProducts", barmode = "group", title = "<b>Customer distribution<b>")
fig.update_layout(width=700, height=500, bargap=0.2)
fig.show()
## Graphical representation of Tenure against Exited
cross_tab = pd.crosstab(df['Tenure'],df['Exited'])
cross_tab.plot.bar(rot=0)
plt.title("Effect of Tenure on Exited")
cross_tab
| Exited | 0 | 1 |
|---|---|---|
| Tenure | ||
| 0 | 318 | 95 |
| 1 | 803 | 232 |
| 2 | 847 | 201 |
| 3 | 796 | 213 |
| 4 | 786 | 203 |
| 5 | 803 | 209 |
| 6 | 771 | 196 |
| 7 | 851 | 177 |
| 8 | 828 | 197 |
| 9 | 771 | 213 |
| 10 | 389 | 101 |
fig = px.histogram(data, x="Exited", color = "Tenure", barmode = "group", title = "<b>Customer distribution<b>")
fig.update_layout(width=700, height=500, bargap=0.2)
fig.show()
## Graphical representation of Geography against Exited
fig = px.histogram(data, x="Exited", color = "Geography", barmode = "group", title = "<b>Customer distribution<b>")
fig.update_layout(width=700, height=500, bargap=0.2)
fig.show()
box plots
count plots
swam plots
bar plots
scatter plots
will be the best to depict the realtionships between them.
This includes [EstimatedSalary, Age, CreditScore, Surname, Balance]
#Depiction of CreditScore with Exited
sns.set_context("paper",font_scale=1.1)
ax = sns.kdeplot(data.CreditScore[(df["Exited"] == 0) ],color="Red", fill = True);
ax = sns.kdeplot(df.CreditScore[(df["Exited"] == 1) ],ax =ax, color="Blue", fill= True);
ax.legend(["Not Churn","Churn"],loc='upper right');
ax.set_ylabel('Density');
ax.set_xlabel('CreditScore');
ax.set_title('Distribution of CreditScore by churn');
This depiction is quite overlapping in nature, People with similar credit Score are churned and retained in the same amount.
#Depiction of Estimated Salary with Exited
sns.set_context("paper",font_scale=1.2)
ax = sns.kdeplot(data.EstimatedSalary[(df["Exited"] == 0) ],color="Red", fill = True);
ax = sns.kdeplot(df.EstimatedSalary[(df["Exited"] == 1) ],ax =ax, color="Blue", fill= True);
ax.legend(["Not Churn","Churn"],loc='upper right');
ax.set_ylabel('Density');
ax.set_xlabel('Estimated Salary');
ax.set_title('Distribution of Salary by churn');
#Depiction of Age with Exited
sns.set_context("paper",font_scale=1.1)
ax = sns.kdeplot(df.Age[(df["Exited"] == 0) ],
color="Red", fill = True);
ax = sns.kdeplot(df.Age[(df["Exited"] == 1) ],
ax =ax, color="Blue", fill= True);
ax.legend(["Not Churn","Churn"],loc='upper right');
ax.set_ylabel('Density');
ax.set_xlabel('Age');
ax.set_title('Distribution of Age by churn');
People who churn are of of the age range 20-70 and much lesser than people who dont churn.
Handling of the missing values
No missing values were present, all data preprocessing was done.
Lable Encoding of the Data
Geography and Gender are categorical data that need to be encoded and converted into numerical data.
Feature Selection
Selection of the features that affect the target variable more than the others and hve less bias.
#Converting the Categorical data into Numerical Data
from sklearn.preprocessing import LabelEncoder
# Converting categorical columns to numerical columns
encoder = LabelEncoder()
df["Gender"] = encoder.fit_transform(df["Gender"])
df["Geography"] = encoder.fit_transform(df["Geography"])
df['Surname'] = encoder.fit_transform(df["Surname"])
df.head()
| Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1115 | 619 | 0 | 0 | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
| 1 | 1177 | 608 | 2 | 0 | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
| 2 | 2040 | 502 | 0 | 0 | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
| 3 | 289 | 699 | 0 | 0 | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
| 4 | 1822 | 850 | 2 | 0 | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
#Depiction of Surname with Exited
sns.set_context("paper",font_scale=1.1)
ax = sns.kdeplot(df.Surname[(df["Exited"] == 0) ],
color="Red", fill = True);
ax = sns.kdeplot(df.Surname[(df["Exited"] == 1) ],
ax =ax, color="Blue", fill= True);
ax.legend(["Not Churn","Churn"],loc='upper right');
ax.set_ylabel('Density');
ax.set_xlabel('Surname');
ax.set_title('Distribution of Surname by churn');
#Depiction of Geography with Exited
sns.set_context("paper",font_scale=1.1)
ax = sns.kdeplot(df.Geography[(df["Exited"] == 0) ],
color="Red", fill = True);
ax = sns.kdeplot(df.Geography[(df["Exited"] == 1) ],
ax =ax, color="Blue", fill= True);
ax.legend(["Not Churn","Churn"],loc='upper right');
ax.set_ylabel('Density');
ax.set_xlabel('Geography');
ax.set_title('Distribution of Geography by churn');
#Depiction of Gender with Exited
sns.set_context("paper",font_scale=1.1)
ax = sns.kdeplot(df.Gender[(df["Exited"] == 0) ],
color="Red", fill = True);
ax = sns.kdeplot(df.Gender[(df["Exited"] == 1) ],
ax =ax, color="Blue", fill= True);
ax.legend(["Not Churn","Churn"],loc='upper right');
ax.set_ylabel('Density');
ax.set_xlabel('Gender');
ax.set_title('Distribution of Gender by churn');
Here we tried to depict the relationship between the categorical data features after label encoding with exited variable.
plt.figure(figsize=(15,8))
data.corr()['Exited'].sort_values(ascending = False).plot(kind='bar')
C:\Users\ARUNIMA\AppData\Local\Temp\ipykernel_25200\2681086695.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
<Axes: >
we can see that in case of Age it is the closest to 1 as compared to other Datafeatures it means that if this variable increase, the likelihood of a customer "exiting" also increases. it has strong influence on customer churn.
We can see that in the case of CreditScore, Number of products and Activemember these variables with negative correlations close to -1 would suggest that as these variables increase, the likelihood of a customer "exiting" decreases. These variables might be associated with customer retention.
We can see in the case of Estimated Salary, hasCreditcard and Tenure these variables are close to 0 indicate a weak linear relationship between the variable and customer churn. This suggests that changes in these variables don't strongly impact the likelihood of a customer "exiting."
We can see that the variables with high Correlation are more likely to effect the Churn datafeature i.e. Age and balance could have an effect on exited variable while the features that will effect the retension of the customers will be the ones with negative correlation.
# Finding the correlation between the independent and dependent feature
plt.figure(figsize=(20, 9))
sns.heatmap(data.corr(), annot=True)
C:\Users\ARUNIMA\AppData\Local\Temp\ipykernel_25200\2448248852.py:3: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
<Axes: >
The red blocks show high correlation or positive correlation, which is of the dataFeature Age. Other blocks show negative or weak Correlation.
# splitting dataset into dependent and independent feature
X = df.drop(columns='Exited')
y = df['Exited']
X.head()
| Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1115 | 619 | 0 | 0 | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 |
| 1 | 1177 | 608 | 2 | 0 | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 |
| 2 | 2040 | 502 | 0 | 0 | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 |
| 3 | 289 | 699 | 0 | 0 | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 |
| 4 | 1822 | 850 | 2 | 0 | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 |
# selects the feature which has more correlation
selection = SelectKBest() # k=10 default
X = selection.fit_transform(X,y)
# this will shows which feature are taken denote as True other are removed like false
selection.get_support()
array([ True, True, True, True, True, True, True, True, False,
True, True])
According to the feature selection, we selects the 10 out of 11 features. these are the 10 features are selected out of which 9 are true and one is false, i.e. HasCrCard while others are: [Surname,CreditScore,Geography,Gender,Age,Tenure, balance,NumofProducts, IsActiveMember,EstimatedSalary]
From sklearn using feature selection modules importing the SelectKBest to select the important feature
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 4, stratify =y)
X_train.shape
(7000, 10)
X_test.shape
(3000, 10)
# its an imbalance dataset
y.value_counts()
0 7963 1 2037 Name: Exited, dtype: int64
Before moving forward,while doing bivariate analysis we were able to see that the range is different for some datafeatures thus the dataset requires standard scalling.
def distplot(feature, frame, color='r'):
plt.figure(figsize=(8,3))
plt.title("Distribution for {}".format(feature))
ax = sns.distplot(frame[feature], color= color)
col = ["CreditScore",'Age','Balance','EstimatedSalary']
for features in col :distplot(features, data)
C:\Users\ARUNIMA\AppData\Local\Temp\ipykernel_25200\3330901477.py:4: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 C:\Users\ARUNIMA\AppData\Local\Temp\ipykernel_25200\3330901477.py:4: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 C:\Users\ARUNIMA\AppData\Local\Temp\ipykernel_25200\3330901477.py:4: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 C:\Users\ARUNIMA\AppData\Local\Temp\ipykernel_25200\3330901477.py:4: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
The features need standard scaling as all of them are distributed over different range values
data_std = pd.DataFrame(StandardScaler().fit_transform(df[col]).astype('float64'), columns = col)
for feat in col: distplot(feat, data_std, color='c')
C:\Users\ARUNIMA\AppData\Local\Temp\ipykernel_25200\3330901477.py:4: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 C:\Users\ARUNIMA\AppData\Local\Temp\ipykernel_25200\3330901477.py:4: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 C:\Users\ARUNIMA\AppData\Local\Temp\ipykernel_25200\3330901477.py:4: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 C:\Users\ARUNIMA\AppData\Local\Temp\ipykernel_25200\3330901477.py:4: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
df.columns
Index(['Surname', 'CreditScore', 'Geography', 'Gender', 'Age', 'Tenure',
'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember',
'EstimatedSalary', 'Exited'],
dtype='object')
for i in df.columns:
print(i, ": ", df[i].unique())
Surname : [1115 1177 2040 ... 1366 44 363] CreditScore : [619 608 502 699 850 645 822 376 501 684 528 497 476 549 635 616 653 587 726 732 636 510 669 846 577 756 571 574 411 591 533 553 520 722 475 490 804 582 472 465 556 834 660 776 829 637 550 698 585 788 655 601 656 725 511 614 742 687 555 603 751 581 735 661 675 738 813 657 604 519 664 678 757 416 665 777 543 506 493 652 750 729 646 647 808 524 769 730 515 773 814 710 413 623 670 622 785 605 479 685 538 562 721 628 668 828 674 625 432 770 758 795 686 789 589 461 584 579 663 682 793 691 485 650 754 535 716 539 706 586 631 717 800 683 704 615 667 484 480 578 512 606 597 778 514 525 715 580 807 521 759 516 711 618 643 671 689 620 676 572 695 592 567 694 547 594 673 610 767 763 712 703 662 659 523 772 545 634 739 771 681 544 696 766 727 693 557 531 498 651 791 733 811 707 714 782 775 799 602 744 588 747 583 627 731 629 438 642 806 474 559 429 680 749 734 644 626 649 805 718 840 630 654 762 568 613 522 737 648 443 640 540 460 593 801 611 802 745 483 690 492 709 705 560 752 701 537 487 596 702 486 724 548 464 790 534 748 494 590 468 509 818 816 536 753 774 621 569 658 798 641 542 692 639 765 570 638 599 632 779 527 564 833 504 842 508 417 598 741 607 761 848 546 439 755 760 526 713 700 666 566 495 688 612 477 427 839 819 720 459 503 624 529 563 482 796 445 746 786 554 672 787 499 844 450 815 838 803 736 633 600 679 517 792 743 488 421 841 708 507 505 456 435 561 518 565 728 784 552 609 764 697 723 551 444 719 496 541 830 812 677 420 595 617 809 500 826 434 513 478 797 363 399 463 780 452 575 837 794 824 428 823 781 849 489 431 457 768 831 359 820 573 576 558 817 449 440 415 821 530 350 446 425 740 481 783 358 845 451 458 469 423 404 836 473 835 466 491 351 827 843 365 532 414 453 471 401 810 832 470 447 422 825 430 436 426 408 847 418 437 410 454 407 455 462 386 405 383 395 467 433 442 424 448 441 367 412 382 373 419] Geography : [0 2 1] Gender : [0 1] Age : [42 41 39 43 44 50 29 27 31 24 34 25 35 45 58 32 38 46 36 33 40 51 61 49 37 19 66 56 26 21 55 75 22 30 28 65 48 52 57 73 47 54 72 20 67 79 62 53 80 59 68 23 60 70 63 64 18 82 69 74 71 76 77 88 85 84 78 81 92 83] Tenure : [ 2 1 8 7 4 6 3 10 5 9 0] Balance : [ 0. 83807.86 159660.8 ... 57369.61 75075.31 130142.79] NumOfProducts : [1 3 2 4] HasCrCard : [1 0] IsActiveMember : [1 0] EstimatedSalary : [101348.88 112542.58 113931.57 ... 42085.58 92888.52 38190.78] Exited : [1 0]
scaler = StandardScaler()
X_train = StandardScaler().fit_transform(X_train)
X_test = StandardScaler().fit_transform(X_test)
!pip install xgboost
Requirement already satisfied: xgboost in c:\users\arunima\anaconda3\lib\site-packages (1.7.6) Requirement already satisfied: numpy in c:\users\arunima\anaconda3\lib\site-packages (from xgboost) (1.23.5) Requirement already satisfied: scipy in c:\users\arunima\anaconda3\lib\site-packages (from xgboost) (1.10.0)
!pip install catboost
Collecting catboost Using cached catboost-1.2-cp310-cp310-win_amd64.whl (101.0 MB) Requirement already satisfied: pandas>=0.24 in c:\users\arunima\anaconda3\lib\site-packages (from catboost) (1.5.3) Requirement already satisfied: numpy>=1.16.0 in c:\users\arunima\anaconda3\lib\site-packages (from catboost) (1.23.5) Requirement already satisfied: graphviz in c:\users\arunima\anaconda3\lib\site-packages (from catboost) (0.20.1) Requirement already satisfied: six in c:\users\arunima\anaconda3\lib\site-packages (from catboost) (1.16.0) Requirement already satisfied: plotly in c:\users\arunima\anaconda3\lib\site-packages (from catboost) (5.9.0) Requirement already satisfied: matplotlib in c:\users\arunima\anaconda3\lib\site-packages (from catboost) (3.7.0) Requirement already satisfied: scipy in c:\users\arunima\anaconda3\lib\site-packages (from catboost) (1.10.0) Requirement already satisfied: pytz>=2020.1 in c:\users\arunima\anaconda3\lib\site-packages (from pandas>=0.24->catboost) (2022.7) Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\arunima\anaconda3\lib\site-packages (from pandas>=0.24->catboost) (2.8.2) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\arunima\anaconda3\lib\site-packages (from matplotlib->catboost) (1.4.4) Requirement already satisfied: packaging>=20.0 in c:\users\arunima\anaconda3\lib\site-packages (from matplotlib->catboost) (22.0) Requirement already satisfied: pillow>=6.2.0 in c:\users\arunima\anaconda3\lib\site-packages (from matplotlib->catboost) (9.4.0) Requirement already satisfied: fonttools>=4.22.0 in c:\users\arunima\anaconda3\lib\site-packages (from matplotlib->catboost) (4.25.0) Requirement already satisfied: contourpy>=1.0.1 in c:\users\arunima\anaconda3\lib\site-packages (from matplotlib->catboost) (1.0.5) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\arunima\anaconda3\lib\site-packages (from matplotlib->catboost) (3.0.9) Requirement already satisfied: cycler>=0.10 in c:\users\arunima\anaconda3\lib\site-packages (from matplotlib->catboost) (0.11.0) Requirement already satisfied: tenacity>=6.2.0 in c:\users\arunima\anaconda3\lib\site-packages (from plotly->catboost) (8.0.1) Installing collected packages: catboost Successfully installed catboost-1.2
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from sklearn import metrics
from sklearn.metrics import roc_curve
from sklearn.metrics import recall_score, confusion_matrix, precision_score, f1_score, accuracy_score, classification_report
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import f1_score, precision_score, recall_score, fbeta_score
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import KFold
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics
from sklearn.metrics import classification_report, precision_recall_curve
from sklearn.metrics import auc, roc_auc_score, roc_curve
from sklearn.metrics import make_scorer, recall_score, log_loss
from sklearn.metrics import average_precision_score
#Standard libraries for data visualization:
models = []
models.append(('Logistic Regression', LogisticRegression(solver='liblinear', random_state = 0, class_weight='balanced')))
models.append(('SVC', SVC(kernel = 'linear', random_state = 0)))
models.append(('Kernel SVM', SVC(kernel = 'rbf', random_state = 0)))
models.append(('KNN', KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)))
models.append(('Gaussian NB', GaussianNB()))
models.append(('Decision Tree Classifier', DecisionTreeClassifier(criterion = 'entropy', random_state = 0)))
models.append(('Random Forest', RandomForestClassifier(n_estimators=100, criterion = 'entropy', random_state = 0)))
models.append(("Adaboost", AdaBoostClassifier()))
models.append(("Gradient boost classifier", GradientBoostingClassifier()))
acc_results =[]
auc_results =[]
names = []
result_col = ["Algorithm", "ROC AUC Mean", "ROC AUC STD", "Accuracy Mean", "Accuracy STD"]
model_results = pd.DataFrame(columns = result_col)
i=0
# K- fold cross validation
for name, model in models:
names.append(name)
kfold = model_selection.KFold(n_splits=10,shuffle= True,random_state=0 )
cv_acc_results = model_selection.cross_val_score(model, X_train, y_train, cv = kfold, scoring="accuracy")
cv_auc_results = model_selection.cross_val_score(model, X_train, y_train,cv = kfold, scoring="roc_auc")
acc_results.append(cv_acc_results)
auc_results.append(cv_auc_results)
model_results.loc[i] = [name,
round(cv_auc_results.mean()*100,2),
round(cv_auc_results.std()*100,2),
round(cv_acc_results.mean()*100,2),
round(cv_acc_results.std()*100,2)]
i+=1
model_results.sort_values(by = ['ROC AUC Mean'], ascending=False)
| Algorithm | ROC AUC Mean | ROC AUC STD | Accuracy Mean | Accuracy STD | |
|---|---|---|---|---|---|
| 8 | Gradient boost classifier | 86.16 | 1.84 | 86.07 | 1.04 |
| 6 | Random Forest | 84.97 | 1.66 | 86.17 | 1.11 |
| 7 | Adaboost | 84.55 | 1.97 | 85.33 | 1.32 |
| 2 | Kernel SVM | 82.89 | 2.37 | 85.63 | 1.13 |
| 4 | Gaussian NB | 80.68 | 1.65 | 82.67 | 1.60 |
| 3 | KNN | 78.74 | 1.32 | 83.54 | 1.12 |
| 0 | Logistic Regression | 75.75 | 1.47 | 70.01 | 1.06 |
| 5 | Decision Tree Classifier | 68.16 | 1.73 | 79.10 | 0.96 |
| 1 | SVC | 64.54 | 5.57 | 79.63 | 1.50 |
fig = plt.figure(figsize=(25,15))
ax = fig.add_subplot(111)
plt.boxplot(acc_results)
ax.set_xticklabels(names)
plt.ylabel('ROC AUC Score\n',
horizontalalignment="center",fontstyle = "normal",
fontsize = "large", fontfamily = "sans-serif")
plt.xlabel('\n Baseline Classification Algorithms\n',
horizontalalignment="center",fontstyle = "normal",
fontsize = "large", fontfamily = "sans-serif")
plt.title('Accuracy Score Comparison \n',
horizontalalignment="center", fontstyle = "normal",
fontsize = "22", fontfamily = "sans-serif")
plt.xticks(rotation=0, horizontalalignment="center")
plt.yticks(rotation=0, horizontalalignment="right")
plt.show()
score_array = []
for each in range(1,25):
knn_loop = KNeighborsClassifier(n_neighbors = each)
knn_loop.fit(X_train,y_train)
score_array.append(knn_loop.score(X_test,y_test))
score_array
[0.7906666666666666, 0.8226666666666667, 0.821, 0.8316666666666667, 0.8366666666666667, 0.8353333333333334, 0.84, 0.8366666666666667, 0.843, 0.8353333333333334, 0.8366666666666667, 0.834, 0.837, 0.833, 0.8336666666666667, 0.832, 0.8326666666666667, 0.834, 0.836, 0.8316666666666667, 0.8353333333333334, 0.8333333333333334, 0.8343333333333334, 0.8323333333333334]
visualize the relationship between the number of neighbors (K) and a score metric (accuracy metric) for a K Nearest Neighbors (KNN) algorithm.
fig = plt.figure(figsize=(15, 7))
plt.plot(range(1,25),score_array, color = '#ec838a')
plt.ylabel('Range\n',horizontalalignment="center",fontstyle = "normal", fontsize = "large", fontfamily = "sans-serif")
plt.xlabel('Score\n',horizontalalignment="center",fontstyle = "normal", fontsize = "large", fontfamily = "sans-serif")
plt.title('Optimal Number of K Neighbors \n',horizontalalignment="center", fontstyle = "normal",fontsize = "22", fontfamily = "sans-serif")
#plt.legend(loc='top right', fontsize = "medium")
plt.xticks(rotation=0, horizontalalignment="center")
plt.yticks(rotation=0, horizontalalignment="right")
plt.show()
score_array = []
for each in range(1,100):
rf_loop = RandomForestClassifier(n_estimators = each, random_state = 1)
rf_loop.fit(X_train,y_train)
score_array.append(rf_loop.score(X_test,y_test))
for i,j in enumerate(score_array):
print(i+1,":",j)
1 : 0.785 2 : 0.8223333333333334 3 : 0.8183333333333334 4 : 0.8333333333333334 5 : 0.833 6 : 0.842 7 : 0.8406666666666667 8 : 0.843 9 : 0.844 10 : 0.8456666666666667 11 : 0.8483333333333334 12 : 0.85 13 : 0.8486666666666667 14 : 0.8516666666666667 15 : 0.852 16 : 0.8513333333333334 17 : 0.85 18 : 0.8506666666666667 19 : 0.8516666666666667 20 : 0.852 21 : 0.8556666666666667 22 : 0.854 23 : 0.855 24 : 0.855 25 : 0.855 26 : 0.8576666666666667 27 : 0.8556666666666667 28 : 0.855 29 : 0.8556666666666667 30 : 0.855 31 : 0.855 32 : 0.8543333333333333 33 : 0.856 34 : 0.8543333333333333 35 : 0.854 36 : 0.855 37 : 0.8563333333333333 38 : 0.8566666666666667 39 : 0.8573333333333333 40 : 0.857 41 : 0.856 42 : 0.8573333333333333 43 : 0.8566666666666667 44 : 0.8573333333333333 45 : 0.8563333333333333 46 : 0.8566666666666667 47 : 0.856 48 : 0.8573333333333333 49 : 0.8576666666666667 50 : 0.8576666666666667 51 : 0.857 52 : 0.8566666666666667 53 : 0.857 54 : 0.8573333333333333 55 : 0.8566666666666667 56 : 0.857 57 : 0.857 58 : 0.8573333333333333 59 : 0.858 60 : 0.8566666666666667 61 : 0.858 62 : 0.8576666666666667 63 : 0.857 64 : 0.8563333333333333 65 : 0.8573333333333333 66 : 0.8576666666666667 67 : 0.858 68 : 0.8583333333333333 69 : 0.8576666666666667 70 : 0.859 71 : 0.8576666666666667 72 : 0.8586666666666667 73 : 0.8593333333333333 74 : 0.8586666666666667 75 : 0.858 76 : 0.8583333333333333 77 : 0.8593333333333333 78 : 0.8583333333333333 79 : 0.8576666666666667 80 : 0.859 81 : 0.8586666666666667 82 : 0.8586666666666667 83 : 0.858 84 : 0.8583333333333333 85 : 0.858 86 : 0.859 87 : 0.8573333333333333 88 : 0.8586666666666667 89 : 0.8573333333333333 90 : 0.8583333333333333 91 : 0.858 92 : 0.8596666666666667 93 : 0.858 94 : 0.8596666666666667 95 : 0.8583333333333333 96 : 0.8603333333333333 97 : 0.8593333333333333 98 : 0.8586666666666667 99 : 0.8593333333333333
fig = plt.figure(figsize=(15, 7))
plt.plot(range(1,100),score_array, color = '#ec838a')
plt.ylabel('Range\n',horizontalalignment="center",
fontstyle = "normal", fontsize = "large",
fontfamily = "sans-serif")
plt.xlabel('Score\n',horizontalalignment="center",
fontstyle = "normal", fontsize = "large",
fontfamily = "sans-serif")
plt.title('Optimal Number of Trees for Random Forest Model \n',horizontalalignment="center", fontstyle = "normal", fontsize = "22", fontfamily = "sans-serif")
#plt.legend(loc='top right', fontsize = "medium")
plt.xticks(rotation=0, horizontalalignment="center")
plt.yticks(rotation=0, horizontalalignment="right")
plt.show()
#evaluation of results
def model_evaluation(y_test, y_pred, model_name):
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
f2 = fbeta_score(y_test, y_pred, beta = 2.0)
results = pd.DataFrame([[model_name, acc, prec, rec, f1, f2]],
columns = ["Model", "Accuracy", "Precision", "Recall",
"F1 SCore", "F2 Score"])
results = results.sort_values(["Precision", "Recall", "F2 Score"], ascending = False)
return results
# Logistic regression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
#SVC
classifier2 = SVC(kernel = 'linear', random_state = 0)
classifier2.fit(X_train, y_train)
y_pred2 = classifier2.predict(X_test)
#knn
classifier3 = KNeighborsClassifier(n_neighbors=22, metric="minkowski", p=2)
classifier3.fit(X_train, y_train)
y_pred3 = classifier3.predict(X_test)
#Kernel SVM
classifier4 = SVC(kernel="rbf", random_state =0)
classifier4.fit(X_train, y_train)
y_pred4 = classifier4.predict(X_test)
#Naive Bayes
classifier5 = GaussianNB()
classifier5.fit(X_train, y_train)
y_pred5 = classifier5.predict(X_test)
#Decision tree
classifier6 = DecisionTreeClassifier(criterion="entropy", random_state=0)
classifier6.fit(X_train, y_train)
y_pred6 = classifier6.predict(X_test)
#Random Forest
classifier7 = RandomForestClassifier(n_estimators=72, criterion="entropy", random_state=0)
classifier7.fit(X_train, y_train)
y_pred7 = classifier7.predict(X_test)
#Adaboost
classifier8 = AdaBoostClassifier()
classifier8.fit(X_train, y_train)
y_pred8 = classifier8.predict(X_test)
#Gradient Boost
classifier9 = GradientBoostingClassifier()
classifier9.fit(X_train, y_train)
y_pred9 = classifier9.predict(X_test)
lr = model_evaluation(y_test, y_pred, "Logistic Regression")
svm = model_evaluation(y_test, y_pred2, "SVM (Linear)")
knn = model_evaluation(y_test, y_pred3, "K-Nearest Neighbours")
k_svm = model_evaluation(y_test, y_pred4, "Kernel SVM")
nb = model_evaluation(y_test, y_pred5, "Naive Bayes")
dt = model_evaluation(y_test, y_pred6, "Decision Tree")
rf = model_evaluation(y_test, y_pred7, "Random Forest")
ab = model_evaluation(y_test, y_pred8, "Adaboost")
gb = model_evaluation(y_test, y_pred9, "Gradient Boost")
C:\Users\ARUNIMA\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1344: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
eval_ =lr.append(svm).append(knn).append(k_svm).append(nb).append(dt).append(rf).append(ab).append(gb).sort_values(["Precision",
"Recall", "F2 Score"], ascending = False).reset_index().drop(columns = "index")
eval_
C:\Users\ARUNIMA\AppData\Local\Temp\ipykernel_25200\4108865185.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. C:\Users\ARUNIMA\AppData\Local\Temp\ipykernel_25200\4108865185.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. C:\Users\ARUNIMA\AppData\Local\Temp\ipykernel_25200\4108865185.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. C:\Users\ARUNIMA\AppData\Local\Temp\ipykernel_25200\4108865185.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. C:\Users\ARUNIMA\AppData\Local\Temp\ipykernel_25200\4108865185.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. C:\Users\ARUNIMA\AppData\Local\Temp\ipykernel_25200\4108865185.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. C:\Users\ARUNIMA\AppData\Local\Temp\ipykernel_25200\4108865185.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. C:\Users\ARUNIMA\AppData\Local\Temp\ipykernel_25200\4108865185.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
| Model | Accuracy | Precision | Recall | F1 SCore | F2 Score | |
|---|---|---|---|---|---|---|
| 0 | Kernel SVM | 0.852333 | 0.800000 | 0.366612 | 0.502806 | 0.411160 |
| 1 | K-Nearest Neighbours | 0.833333 | 0.800000 | 0.242226 | 0.371859 | 0.281476 |
| 2 | Gradient Boost | 0.862667 | 0.774105 | 0.459902 | 0.577002 | 0.500534 |
| 3 | Random Forest | 0.857667 | 0.765896 | 0.433715 | 0.553814 | 0.474910 |
| 4 | Naive Bayes | 0.830667 | 0.751220 | 0.252046 | 0.377451 | 0.290676 |
| 5 | Adaboost | 0.852000 | 0.714653 | 0.454992 | 0.556000 | 0.490646 |
| 6 | Logistic Regression | 0.808000 | 0.594595 | 0.180033 | 0.276382 | 0.209205 |
| 7 | Decision Tree | 0.790667 | 0.486486 | 0.500818 | 0.493548 | 0.497885 |
| 8 | SVM (Linear) | 0.796333 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
predictions = [y_pred, y_pred2 , y_pred3, y_pred4, y_pred5, y_pred5, y_pred6, y_pred7,
y_pred8, y_pred9]
for i, j in zip(predictions, eval_.Model.values):
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, i),
annot=True,fmt = "d",linecolor="k",linewidths=3)
plt.title(j,fontsize=14)
plt.show()
k-Fold Cross-Validation: Model evaluation is most commonly done through ‘K- fold Cross-Validation’ technique that primarily helps us to fix the variance. Variance problem occurs when we get good accuracy while running the model on a training set and a test set but then the accuracy looks different when the model is run on another test set. So, in order to fix the variance problem, k-fold cross-validation basically split the training set into 10 folds and train the model on 9 folds (9 subsets of the training dataset) before testing it on the test fold. This gives us the flexibility to train our model on all ten combinations of 9 folds; giving ample room to finalize the variance.
def k_fold_cross_validation(classifier_name, name):
accuracies = cross_val_score(estimator=classifier_name,
X=X_train, y=y_train, cv =10)
print(name, "accuracy: %0.2f (+/- %0.2f)" % (accuracies.mean(), accuracies.std() * 2))
k_fold_cross_validation(classifier8, "Adaboost")
Adaboost accuracy: 0.86 (+/- 0.02)
k_fold_cross_validation(classifier, "Logistic regression")
Logistic regression accuracy: 0.81 (+/- 0.02)
k_fold_cross_validation(classifier9, "Gradient Boost classifier")
Gradient Boost classifier accuracy: 0.86 (+/- 0.02)
k_fold_cross_validation(classifier7, "Random Forest classifier")
Random Forest classifier accuracy: 0.86 (+/- 0.02)
k_fold_cross_validation(classifier6, "Decision Tree classifier")
Decision Tree classifier accuracy: 0.79 (+/- 0.04)
k_fold_cross_validation(classifier4, "Kernel SVM")
Kernel SVM accuracy: 0.86 (+/- 0.02)
k_fold_cross_validation(classifier3, "Knn")
Knn accuracy: 0.84 (+/- 0.02)